Improved Estimation of Correlation in Microarray Data Analysis

نویسندگان

  • Marc Sobel
  • Bud Mishra
چکیده

In the original work on clustering due to Eisen et al., in which they performed one of the most highly-re-analyzed microarray dataset of gene expressions, the authors claimed to have “found in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function.” However, they measured similarity between any pair of genes using a somewhat non-standard definition of correlation coefficient instead of Pearson’s correlation coefficient, an unbiased estimator. Eisen et al.’s paper remains mysteriously silent about how drastically the clusters of genes would change if one changed the definition of similarity back to Pearson’s correlation coefficient, or to any other from a larger family of estimators between Pearson’s and Eisen et al.’s, obtained by a “shrinkage coefficient” taking a value between 0 and 1. Their approach raised several issues: what would be the best shrinkage coefficient; how can one compute it and whether it can be computed quickly in a closed-form. Mishra and his students answered these questions in a recent paper, but left it for future research to understand if there is even a wider family of similarity metrics, even though such metrics may not be computable in closed form. We take up this problem in this paper and suggest how to compute a somewhat better similarity metric using an MCMC algorithm; how to define an intuitively clear Bayesian risk assessment and finally, how to interpret the empirical results obtained through simulation. ∗To whom correspondence should be addressed. E-mail: [email protected] 1 Problem Formulation In the post-genomic biology, clustering genes by their similarity has now occupied many biologists and statisticians for almost half a decade. Although the relevance of such a succinct representation in understanding fundamental principles of biology is yet to be firmly established, there is much less disagreement that the resulting data reorganization may add clarity to the subsequent bioinformatic analysis and experiment design, e.g., interpreting ChIP-Chip experiments, looking for cis-regulatory elements, etc. There is now a rapidly mushrooming body of genomics literature devoted to clustering, co-clustering, bi-clustering, etc., with random or designed sets of conditions and different definitions of similarity, and yet, there is much less attention paid to derive a statistically robust definition for similarity of genes. In the usual setting, starting with a series of expression microarray experimental data, one wishes to estimate similarity between the expression levels of a pair of genes because it is frequently indicative of functional relationships between them. Highly correlated transcriptomic behavior of a group of genes often suggests the presence of causal relationships, usually through common regulatory mechanisms. Identifying such potential relationships is of primary importance 1. In understanding and modeling microarray and other genetic data, and 2. In inferring functional relationships crucial to predictive and other kinds of inference. These identifications frequently arise from partitioning genes into closely related groups, called clusters. Traditionally, algorithms for cluster analysis of expression data are based on statistical properties of gene expressions and result in organizing genes according to similarities between their patterns (see [1]).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Extracellular exosomes and preeclampsia: a microarray-based study and functional enrichment analysis

Background:  Preeclampsia (PE) is a heterogeneous pregnancy disease which the exact pathophysiology of it is unknown. Recently exosomes have been indicated as a causative factor in the pathogenesis of PE. The aim of the study was to investigate in microarray library data to extract the differentially expressed genes (DEGs) in PE and to perform a functional enrichment analysis to predict the rol...

متن کامل

Estimation of false discovery proportion under general dependence

MOTIVATION Wide-scale correlations between genes are commonly observed in gene expression data, due to both biological and technical reasons. These correlations increase the variability of the standard estimate of the false discovery rate (FDR). We highlight the false discovery proportion (FDP, instead of the FDR) as the suitable quantity for assessing differential expression in microarray data...

متن کامل

Gene Identification from Microarray Data for Diagnosis of Acute Myeloid and Lymphoblastic Leukemia Using a Sparse Gene Selection Method

Background: Microarray experiments can simultaneously determine the expression of thousands of genes. Identification of potential genes from microarray data for diagnosis of cancer is important. This study aimed to identify genes for the diagnosis of acute myeloid and lymphoblastic leukemia using a sparse feature selection method. Materials and Methods: In this descriptive study, the expressio...

متن کامل

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...

متن کامل

تحلیل تصاویر ریزآرایه به منظور تشخیص نوع سرطان سینه

Background: Microarray technology is a powerful tool to study and analyze the behavior of thousands of genes simultaneously. Images of microarray have an important role in the detection and treatment of diseases. The aim of this study is to provide an automatic method for the extraction and analysis of microarray images to detect cancerous diseases. Methods: The proposed system consists of t...

متن کامل

تحلیل تصاویر ریزآرایه به منظور تشخیص نوع سرطان سینه

Background: Microarray technology is a powerful tool to study and analyze the behavior of thousands of genes simultaneously. Images of microarray have an important role in the detection and treatment of diseases. The aim of this study is to provide an automatic method for the extraction and analysis of microarray images to detect cancerous diseases. Methods: The proposed system consists of t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004